Variations in Substitution Rate in Human and Mouse Genomes 
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We present a method to quantify spatial fluctuations of the substitution rate on different length 
scales throughout genomes of eukaryotes. The fluctuations on large length scales are found to be 
predominantly a consequence of a coarse-graining effect of ffuctuations on shorter length scales. This 
is verifled for both the mouse and the human genome. We also found that both species show similar 
standard deviation of fluctuations even though their mean substitution rate differs by a factor of 
two. Our method furthermore allows to determine time-resolved substitution rate maps from which 
we can compute auto-correlation functions in order to quantify how fast the spatial fluctuations in 
substitution rate change in time. 



The detailed knowledge of the mechanisms of muta- 
tions in living organisms is of fundamental importance 
for understanding the evolution of genomes. On the ba- 
sis of development of every organism is the cell repro- 
duction cycle and mutations can be seen as errors made 
in DNA during the process of chromosome repUcation. 
Mutations occurring in germ-line cells are inherited and 
passed onto the next generation. A large fraction of these 
mutations are substitutions of single nucleotides (G,C,A 
or T) by another while other mutations are due to inser- 
tions or deletions of one or more nucleotides in the DNA 
sequence. In recent time there is growing acceptance that 
the substitution rate is not spatially constant inside the 
genomes of mammals QjHSSIEIflSIll- This is a sur- 
prising result at first sight, as nucleotide substitutions re- 
sulting from copying errors should be fairly independent 
on the actual position, at least on large length scales. 
Unfortunately, this picture is too simple, as there exist 
strong correlations in the mammalian genome between 
mutation rate and other evolutionary rates (e.g. recom- 
bination rate) 0, 0, 0| . Although the origin of the sub- 
stitution rate bias in genomes of mammals is unknown, it 
is highly important to quantify the length and time scales 
where changes are occurring. This is because the amount 
of conserved sequences between the genomes of different 
species depends crucially on the fluctuations of the local 
mutation rate. But these conserved sequences give the 
best insight how much of the sequence in genomes has 
function and is therefore under selective constraint. 

Here, we present a method to calculate substitution 
rates in genomes of eukaryotes. On grounds of our 
method we make use of interspersed repeats . 
Interspersed repeats are sequences of 3 ■ 10^ — 5 • 10'^ base 
pairs in length whose copy was inserted up to 10^ times 
in the human genome. Each copying machinery has only 
worked for a short time in the history of the organism 
and from that time on, the copies of a specific repeat 
type have accumulated substitutions and other muta- 
tions. Due to the large number of copies the original 
sequence can be reconstructed quite accurately in most 
cases and differences between a given copy (repetitive 



element) of an interspersed repeat and its consensus se- 
quence allows to estimate the mutation rate at the posi- 
tion of this repetitive element if the time is known when 
the coping machinery was active (c.f. Revs the 
human genome there have been classified more than 300 
different interspersed repeats which occupy in total more 
than 40% of the genome, which gives us a large set of 
sequences at hand which is most likely under no selective 
constraint. We visualize interspersed repeats as "mea- 
surement devices" for the underlying local substitution 
rate. This requires to solve three major problems: (i) 
single repetitive elements are usually too short and have 
on average accumulated too few substitutions to give a 
reliable estimate for the substitution rate at their posi- 
tion in the genome, (ii) the repetitive elements show a 
very broad length distribution (implying that our "mea- 
surement devices" differ in their " sensitivity" , which is 
proportional to their length), and (iii) the repetitive el- 
ements belonging to different types differ in general in 
their age (thus, the measurement devices have been mea- 
suring over different periods of time). Now the major 
theoretical task is to derive from the large amount of 
repetitive elements, which have large diversity in their 
age and sensitivity, reliable information about the un- 
derlying substitution rate at different positions and at 
different times in the genome. 

For our analysis we take a set of M different types of 
interspersed repeats (M = 200 — 300). Then, for inves- 
tigating variations in the substitution rate on different 
length scales we divide each genome (mouse and human) 
into Z equally sized partitions (7 = 1, Z). The total 
number of bases of all repeats of type a in the partition 
7 is denoted by Na^ and the corresponding total number 
of base mismatches to the consensus sequence originat- 
ing from single nucleotide substitutions is given by k^j. 
One may consider the quantity N^-^ as a measure for 
the sensitivity of the measuring device, which are repet- 
itive elements of the same type a located in partition 
7. Next, we define for each partition (7) and each re- 
peat type (a) the average divergence, Qa-y, which is the 
average probability to observe a base mismatch to the 
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FIG. 1: Local substitution rate of the human genome for 
differently sized partitions. Vertical lines at the bottom divide 
the genome into the 22 chromosomes. The inset shows the our 
results for chromosome 22 (solid line) in comparison with the 
result of Ref. (dashed line). 



consensus sequence. For statistically independent substi- 
tutions, the probability to find k^^ mismatches of bases 
in interspersed repeats of type a located in the partition 
7, is given by 
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In our case the conditional probability distributions 
P{kay\Na-y), P{Qa-y\Na-y) are uniform for < k^-y < 
Na-y, < Qoi-y < 1. So, within these limits, we can em- 
ploy Bayes' theorem to write Yia -) -^iQa^ylka^y, Na^y) 
Yia -f -^i^a-ylQa-y, Nay) This mcaus, that given the sets 
Naj, kaj we can compute a probability distribution for 
Qa-y- To obtain the most probable values {Qa-y} for the 
variables {Qaj} we have to fulfill the necessary condition 
for a maximum 
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and a^^^ J2a,-y^^'^PiQai\kay, Na^) < for Qay = Q*^. 

The equations (0) form a set of non-linear equations. For- 
tunately one can give a good estimate for values Q* ^ and 
thus a few iterative steps using Newton-Rapson method 
are sufficient to determine the Maximum Likelihood val- 
ues of the joint probability distribution, Eq. (QJ. To gain 
information from the quantities Q* ^ about the quantity 
we are really interested in, the local substitution rate 
in partition 7, denoted by m-y(t), we now introduce a 
microscopic model for base substitutions. Statistically 
independent changes of a base at a given position in the 
genome at time t can be described by the following Mas- 
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FIG. 2: Standard deviation, a a, of the local fluctuations of 
the substitution rate versus the size of the partitions for the 
human and mouse genome. The inset shows the normalized 
standard deviations, o-a/^/Z- The normalization, V^, is cho- 
sen such that the standard deviation, a a, would be constant 
for all partition sizes if Ty were stochastically independent. 
Filled squares represent results of a reference calculation to 
test the error of the method (see text). 



ter equation 



dtPb{t) = - ^ Wbb' {t) Pb' {t) 



(3) 



with b, b' e {A, T, G, C} and pb{t) the probabihty of ob- 
serving the base b at that site at time t. The transition 



matrix w has the elements Wbb' 
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(t) qbb' for b ^ b' 



_ and Wbb — X)b' ^-yit) Qbb' [13 ■ Here, the elements of the 
matrix q, qbb> , are the transition probabilities that a base 
b' mutates to a base b whenever a substitution occurs. 
The probability that a certain base of a repetitive ele- 
ment of type a still coincides with its corresponding base 
in the consensus sequence after the time ta is then given 
by p{ta)-p{0) (p is a vector with elements pb). The time 
ta denotes the time distance from today to the moment 
when the interspersed repeat of type a was inserted into 
the genome for the first time. We emphasize that near- 
est neighbor effects have impact onto the substitution 
rate pattern (c.f. Ref. ^3) but seem not to dominate 
the fiuctuations in substitution rate on the large length 
scales considered . In the following we make the simpli- 
fying approximation, called Jukes-Cantor approximation 
[l3j, which amounts in taking an uniform substitution 
pattern, qbb' — 1/4 for 6 ^ 6'. This is clearly a crude ap- 
proximation as, e.g., the transitions A : T G : C occur 
about a factor 3 — 4 more frequently than other substi- 
tution events [l^ Il5l|. This approximation is justified a 
posteriori, by performing Monte Carlo simulation of syn- 
thetic data sets which include this high asymmetry in the 
substitution pattern. Within our stochastical model, Eq. 
(PI , the mean divergence per base of a repetitive element 
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of type a in partition 7 is given by 



exp 



-A / mj(t')dt' 
Jo 



(aA-p(O)f (4) 



where a;^ denotes an eigenvector of the matrix q and A 
its eigenvalue. We define the genome-wide average sub- 
stitution rate by m{t) = 1/^X^7=1 ™7(0 the spatial 
deviations from this rate as Tj(t) — my{t) — m{t). We 
further introduce the time-averaged mean substitution 
rate by rha = 1/ta J^" 'm{t')dt' and the corresponding 

deviations as fa^ = 1/ta Jq" Tj{t')dt' . With these ab- 
breviations, the argument in the exponential of Eq. Q 
reads fhata + fa^ta = Jq" rn-y(t')dt'. It is clear that we 
can obtain only the time-averaged quantities {rha, Ta-y} 
from the the knowledge of Q^^- 

The data set for interspersed repeats we use in our 
analysis is taken from the UCSC Genome Bioinformatics 
Site . This data set was created using RepeatMarker 
together with the consecutive sequences from the Rep- 
Base database 0. The repetitive elements can be di- 
vided into lineage-specific repeats (defined as those in- 
troduced by transposition after the divergence of human 
and mouse) and ancestral repeats (defined as those al- 
ready present in the common ancestor). In following we 
analyse ancient repeats, to calculate the time averaged 
spatial fluctuations in substitution rate since the time 
of divergence of these two species. By taking ancient 
repeats from a sufficiently narrow time window we can 
make the approximations fj w t^q, m = m{t) and thus 
m Pi rha- We set the time-scale by defining the average 
genome-wide substitution rate of human as friH = 1, re- 
sulting in an average substitution rate for mouse given 
by rfiM — 2.05ifiH for ancient repeats (c.f. Ref. Q). As 
start values for the Newton iteration scheme we take — 

and ta = -l/(4m)ln[l - (4/3) ^7=0 kaj/E^Na^]- 
Thus depends on the Z + M parameters, {ta} and 
{f^,} which can be determined to high accuracy from Eq. 
^ as shown in Fig. for the human lineage using 
two differently sized partitions. The standard deviations 
of these fluctuations, cta = (X]7=i ^7)^^^' is shown in 
Fig. |5J). The local substitution rates show significant 
correlation with neighboring partitions only on length 
scales smaller than 5 • 10^ base pairs. On larger length 
scales the variations in substitution rates result mostly 
from coarse-graining of statistical independent partitions 
of smaller size as can be shown by rescaling the standard 
deviation, a a, by Z^^/^ (c.f. inset of Fig. (0). If the 
fluctuations were statistically independent for the high- 
est spatial resolution, Z = Zmax, then Z~^^'^aA would 
be constant for all Z < Z„iax and this seems to be the 
case for partitions of size > 5 Mbp's {Z < 500). By com- 
paring the variations in substitution rate between mouse 
and human we find identical standard deviations, a a, for 
both species, Fig. ||2Jl. This is a very surprising result 
as the mean substitution rate differs by a factor two be- 



tween both species and the genome of mouse is about 
14% smaller than that of human. But this agreement in 
the absolute magnitude could be accidental. We checked 
the bias due to the choice of repeat types by building 
randomly subsets of interspersed repeats which consist 
just of half the total number of repeat types used and 
repeating the calculations. We also investigated the sta- 
tistical errors of our genome data set {ka-f} by randomly 
creating such a data set by Eq. Q), using the readily de- 
termined values {^q} and {f^} from the true genome data 
and accounting for the large asymmetry in the substitu- 
tion pattern. The standard deviation of the combined 
error in determining the values a a is shown in Fig. (0) 
by the error bars for four different partitions. This vali- 
dates our method and demonstrates that the stochastic 
model, Eq. JQ), is appropriate. 

So far, we have computed f-y for a specific time- window 
for the class of youngest ancient repeats. Including 
also the lineage-specific repeats, we can repeat our op- 
timization procedure but now with time windows includ- 
ing all repetitive sequences but grouped in eight differ- 
ent equally distant age classes. We then approximate 

^7*'' — /o''^7(^') ~ '^"7 interspersed repeats, a, 

which belong to time window i, whose mean time dis- 
tance from today is given by ti. We recall that f\^'' is a 
time-averaged quantity, so it does not reflect the strength 
of fluctuations of the substitution rate as found today in 
the human and mouse genomes. The reason for f\^^ being 
different from T^{ti) is that by the reorganizations within 
the chromosomes, i.e. by insertions and deletions of se- 
quences, the local substitution rate can change in time. 
It is then clear that the substitution rate fluctuations 
as shown in Fig. Q are significantly smaller in ampli- 
tude than the true (i.e. actual) variations in substitution 
rate, T-y(t). To give a good estimate of the magnitude 
of the true fluctuations we have to include the effect of 
genome reorganizations in our model. Within a certain 
large partition 7 of size > lOMbp the number of repeats 
belonging to the same time window is sufficiently large 
and substitution rate across this partition is the result of 
coarse graining over almost independent fluctuations on 
smaller length scale. Therefore we can employ the cen- 
tral limit theorem to predict that distribution of T^{t) for 
7 will be close to a Gaussian distribution. This gets sup- 
ported by our analysis of ancient repeats, c.f. Fig. 
On these length scales we can also expect that the under- 
lying process which changes the local substitutions rate is 
Markovian as the correlation length between different lo- 
cal reorganizations in the genome can be assumed to be 
much smaller than partition size. Then, assuming this 
process to be quasi-stationary, one is tempted to write 
for the actual variation of the local mutation rate, Tj{t), 
within a continuous time model 



dtT^{t) = a(t)T^{t) + ri^{t) 



(5) 



This is because by Doob's theorem it is essentially the 
only process satisfying the conditions stated above. Here, 
a{t) is slowly varying, reflecting changes in the average 
mutation rate and rj^ (t) is chosen to be white noise with 
zero mean. The auto-correlation function of the process, 
Eq. is exponentially decaying. Thus, we obtain for 
the auto-correlation function of the time-averaged local 
substitution rate, C(ti,tj) = 
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with (7^ = ^/Z '^^^gTj{t)Tj{t) the variance of the fluc- 
tuations of actual substitution rate and a = a(t). Taking 
the fit parameters, a and a, time-independent is clearly 
an approximation but might be not a bad one for the 
human species (c.f. Ref. H Ei). Fig. © shows the 
correlation function, tt'C{t,t'), for the human lineage. 
The free parameters a = oiZ) and a in Eq. (|SJ) are ob- 
tained by a least mean square fit from the data. For 
the partition sizes resulting from Z — {50, 100, 370} we 
obtain the values a = 55 ± 15 and by assuming a = 55 
we obtain from a second fit cr = {0.033} [0.033, 0.035] 
{Z = 50); cr = {0.043}[0.040, 0.040] (Z = 100); ct = 
{0.067} [0.067, 0.071] {Z ^ 370). The values in the brack- 
ets are obtained from the values for a of the two other 
partition sizes multiplied with the scaling factor of the 
two OA'S, from the corresponding partitions sizes as given 
in Fig. © e.g. ct(Zi) = a a{Zx)o{Z2) ja a(Zi)- Thus the 
standard deviation for the fluctuations in substitution 
rate is about a factor 1.7 larger as found in the analysis 
using ancient repeats (c.f. Fig. (jSJ). The time when the 
correlations have decayed to its maximum value for 
these partitions is about 1/3 of the time since mouse and 
human have diverged. This in turn gives us an impression 
on which times-scales genome reorganizations alter local 
substitution rates in the human genome. We emphasize 
that our method can not resolve substitution rates on ar- 
bitrary small length scales as our "measurement devices" 
(repetitive elements) get shifted by insertions and dele- 
tions in other partitions over time and report therefore a 
spatially averaged substitution rate. 

In conclusion we have shown that fluctuations in sub- 
stitution rate on large length scales arise predominantly 
from a coarse-graining process of fluctuations on smaller 
length scales. Significant correlations with neighboring 
partitions are found for the human and mouse genomes 
only on length scales smaller than 5 • 10^ base pairs (c.f. 
inset Fig. |2l). Moreover, both species show remarkable 
similarity in the the standard deviation of the fluctua- 
tions in substitution rate on all length scales considered. 




FIG. 3: The time averaged correlations function titjC{ti,tj) 
versus ti+tj using 8 equally sized time windows. We averaged 
over all values which belong to the same ti + tj . The lines are 
given from the first two terms of Eq. JSJ with the unknown 
variables obtained from a least mean-square fit. 



Distinguishing between time-averaged rates and actual 
substitution rates found on today's genome, we have fur- 
thermore been able to show that the latter are signifi- 
cantly larger as the time-averaged ones. Clearly, these 
are the fluctuations one has consider when trying to ex- 
plain the large amount of highly conserved sequences be- 
tween the human and the mouse genome 1 17^. 
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